COGS 108

Video

https://drive.google.com/file/d/1Jn3pSgIE1aye5TuL6EFp6uR7X6AJudox/view?usp=sharing

Permissions

Place an X in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).

Overview

In this project, we investigate the relationship between health, education, and home security with crime rates and conclude that there is little correlation between health, education, home security with crime rates. We built linear regression lines using ordinary least-squares and visualized them using a scatter plot. By looking at the regression line and visualization, it turns out that crime rates have a weak negative correlation with health, education, and home security respectively. We then applied linear regression, radial basis function from support vector regression, polynomial from support vector regression, ridge, and poisson regression and concluded there is no relationship between our interests.

Names

Research Question

Is there a relationship between health, education, and house security spending and all different categories of crime rates in all 441 valid FSIP areas in San Diego County?

Background & Prior Work

What background information led to our research topic

In Becker's Economic Theory of Crime (1968), he stated that people resort to crime only if the costs of committing the crime are lower than the benefits gained. However, it turns out that crime is much more prevalent among poor, disadvantaged neighborhoods than among wealthy and middle-class neighborhoods, even though those wealthier neighborhoods are more likely to have precious possessions. One potential reason for such an unexpected phenomenon could be found in the article "Why Disadvantaged Neighborhoods are More Attractive Targets for Burgling than Wealthy Ones." The authors Alyssa W. Chamberlain and Lyndsay N.Boggess claimed that wealthier communities have lower burglary rates because burglars tend to live further away from and are unfamiliar with wealthier neighborhoods. Instead, they are more likely to target disadvantaged neighborhoods since they lived there, which lowers the risk for burglars to commit crimes (due to the familiarity). In reality, most people want to live in a "safer" neighborhood. Most "safer" neighborhoods tend to have relatively higher rents or housing prices. This type of motivation is also one of the reasons why people always associate the crime rate with the wealth level of the community. Since an individual's wealth level influences tons of decision-making, it is often measured by various factors, including salary, education expenditure, medical spending, debts, home security system spending, and other aspects. Among the above expenditures, health, education, and home security system spending in everyday life are some of the most critical factors determining one's wealth level. Therefore, in this project, we aim to find the relationship between the wealth level of communities - more specifically, the health, education, and home security system spending of people living in those communities - and the crime rate.

Since a combination of other sub-factors determines wealth level, we decide to shrink the range from wealth level to expenditure on health, education, and home security system (as expenditure on these is often considered the most fundamental contributor to measuring one's wealth). From the report "How does Health Spending in the U.S. Compared to Other Countries," we find that the United States spent about 11,946 dollars per capita on health consumption, far beyond any other country. Japan, for example, only has 4691 dollars per capita on health consumption, which is even less than half of the United States' spending. Based on this undeniable fact that the United States is the wealthiest country and has the highest GDP contribution in the entire world, we have concluded that medical spending could reflect the wealth level. In a similar approach, we observe that education plays a decisive role in economic performance. People with higher education levels often earn higher salaries compared to those with lower degrees of education. More generally, richer countries tend to have more educated populations, which also leads to economic growth at a national level. Furthermore, there are some parallels between home security and one’s financial performance since it is obvious that wealthy communities are safer, as people living there are more likely to spend money on home security systems. As a result, we believe that wealth health can be well reflected by spending on health, education, and home security systems.

It is impossible to incorporate all the data of each sub-factor contributing to one's wealth level or determine an authoritative and exhaustive metric that fully represents the variable. We eventually decide to take health, education, and home security system spending to determine one's wealth level. We are also aware that there might be confounding variables in our study of finding the relationship between these three spending factors and the crime rate. Thus, determining such a relationship is only our first step, possibly with separate study cases for each confounding variable.

Why is this topic important

A growing body of research has shown that most people with criminal records have serious health care needs, especially with a history of mental illness or psychological distress, as well as a lack of education. As a result, this prevalence of mental illness and lack of education in the criminal justice population has led the government to adopt the thinking that better access to health care and education helps reduce crime. Apart from that, it is not uncommon to acknowledge that crimes are much more prevalent among poor, disadvantaged neighborhoods than among wealthy and middle-class neighborhoods, where home security systems are better. While studies have proven that an increase in health, education, and home security system spending would cause a crime reduction, little research has been done focusing on the relationship between the three spending factors and the crime rate of communities. Taking it as our research interest, we believe that if we could find a relationship between them, we could utilize such findings to reduce the crime rate to the greatest extent by predicting the incidence of crime in each community. Therefore, governments or organizations can enact more restrictive laws and send more police force to those communities with higher estimates.

In addition to the passive reduction of crimes, we could take advantage of this finding and solve the issue from the root. By discovering communities with significantly low health, education, and home security system spending, the government could open more treatment facilities, schools, and security offices in such areas and make related expenditures more affordable. With better access to health, education, and home security systems serving as the first step, we could gradually improve the entire community's well-being, both economically and socially.

Other concerns

In addition, we notice that the distribution of wealth level is not normal (rather more right-skewed), whereas the distribution of three spendings is approximately normal. Therefore, in this project, it will be reasonable for us to use a normal distribution to approximate the spending on health, education, and home security systems.

It is easy to find datasets about health, education, and home security system spending and criminal rate. Nevertheless, La Jolla is so small for us to get a large enough dataset (because most of the data are counted in a constituency). Thus, we decide to treat San Diego County as the base and choose our data within this larger range.

Esri is one of the biggest data holders in the world. While it does not create or record data, it holds data for the federal government, state government, huge corporations, organizations, and individuals. ArcGIS Online is one of its tools for visualizing and manipulating its data. Taking advantage of its USA Census data and database, which contains detailed census data for every constituency, such as different crime indices, household income, and health information, we can establish different models and data frames to visualize. From that, comparisons between multiple categories can then be utilized to reveal the relationship between health, education, and home security system spending and crime rates.

References (include links):

Hypothesis

Hypothesis:

The higher the average household health / education / home security system spending the sector has, the lower the crime rate sector has.

Defence:

Dataset

Basic Information

Description:

This dataset provides information about general health care spending, educational spending, and home security system services in San Diego county in 2021. It also includes data describing the crime rate in the same year, including crimes such as murder, rape, robbery, assault, property crime, burglary, larceny, and motor vehicle theft. This dataset will be used to determine whether the community’s spending on the above three categories (general health care, education, and home security system services) is associated with (particularly with an antagonistic relationship) the crime rate in the community.

Setup

If you did not install the packages of plotly and sklearn, please run the cell below to install the packages

To better perform our data analysis task and answer our research question, additional functionalities outside what is included in Python by default are required. We import the following useful packages using their common shortened names (i.e., patsy, NumPy, pandas, seaborn, etc.).

Data Cleaning

The below script will get us the dataset needed to run this Jupiter notebook. Since we have a CSV file ready on Github for our Spending-Crime Data, we can directly import the data file using the URL. We import our dataset into a DataFrame called df.

Since our research questions focus on whether there is a relationship between health, education, and house security spending and all different categories of crime rate in all 535 valid FSIP areas in San Diego County, we are only interested in variables relative to health, education, housing security, and crime. Thus, the first step in data cleaning is only to extract or include information about these four significant variables. As a result, we have decided to remove other irrelevant or useless columns, including county, state, ID, country code, etc.

After removing all missing information, we look at all of the columns. Since some of the variable names are relatively long compared to others, we decided to rename all of them into a more standard form. We create a function called col_clean (with input parameter as a string) that helps us standardize the messy column titles. After applying the standardized method, our new column names are presented in lowercase letters with the underline separating each word.

After observing the current dataset (excluded all useless columns), we have found that there exist some missing values under the POP2020 column. Since there is no population identified in those areas with missing values, it is not meaningful to include such areas with its corresponding values (health, education, and housing security) spending information. Therefore, to prevent them from leading biases and outliers in our analysis output, we have decided to remove all the rows that contain missing values. After this, we do the POP2020 column anymore, so we droped it.

We check our dataframe again.

We want to have an overview of each varaible in df dataset, thus we use the describe method to get the descriptive statistics for all variables.

After briefly cleaning the dataset, we apply the plot the overall distribution for each variable of our interest. By plotting histograms for average health care spending, average education spending, average home security system services spending, and total crime aggregate, we could have a better idea of what the distributions look like visually. The graph below shows that all four distributions are approximately normal, but a little bit skew to right.

As we can see from the graphs above, there are some outliers on the far right end of the distributions for all histograms. We define the data as outliers if they are at least two standard deviations away from the mean. While outliers can sometimes be very informative about the data collection process, they can also increase the variability in our data, which leads to a decrease in the statistical power. Furthermore, removing outliers could reduce the least square error in our regression analysis, causing our results to become more valid. Due to the above justifications, we have found the outliers of avg_health_care, avg_education, avg_home_security_system_svcs, and total_crime_aggregate and removed them from our dataset. We remove the outliers from crime aggregate because we lack the data of the average for crime rate.

After removing the outliers, we perform data standardization that puts different variables on the same scale and allows us to do further analysis. Following the standardized test statistic for z-scores, we have transformed each value into the value itself subtracted by the mean and then divided by the standard deviation. In the end, we assign these new values back into t_standardized_data and save them for future use.

Data Analysis & Results

Since we already cleaned our dataset by removing all irrelevant columns, missing values, and outliers as well as performing the z-score standardization in the Data Cleaning section, now we can finally start analyzing our data.

To better visualize our clean data on health care spending, education spending, home security system service spending, and total crime aggregate, we apply the sns.histoplot command to plot the overall distribution for each of them.

However, as we have observed from the previous four histograms, the range of the x-axis appears to be extremely large. (For health care spending, the range lies in between 0 to 25,000,000 dollars; for education spending, the range lies in between 0 to 8,000,000 dollars; for home security system spending, the range lies in between 0 to 175,000 dollars; for total crime aggregate, the range lies in between 0 to 800,000).

Therefore, to make the histograms more reader-friendly, we decide to rescale the x-axis by changing the units. For the health_care_standardized histogram, we rescale the x-axis with units in 1 million dollars. For the education_standardized histogram, we rescale the x-axis with units in 1,000 dollars. For the home_security_standardized histogram, we rescale the x-axis with units in 1,000 dollars. For the aggregate_crime_standardized histogram, we rescale the x-axis with units in 1,000.

After plotting the four histograms, we can observe that they now look much more readable compared to the ones before applying the unit change. Meanwhile, since we’re using the same dataset (with a slight change in units), the graphs still preserve the same approximately normal distributions. From a statistical point of view, this normal distribution makes sense due to the central limit theorem, a probability theory which states that the distribution of a sample variable approximates a normal distribution as the sample size becomes larger. In our case, since the sample size is large enough, the central limit theorem indeed holds.

While the above figures provide us with a brief visualization of our dataset by displaying the overall distributions of the major four variables of interest (health care spending, education spending, home security system service spending, total crime aggregate), we still need further analysis to investigate the relationship between them.

As we learned in class, scatterplots are best used to determine whether or not two variables have a relationship or correlation. Besides, since all of our values are ratio scales, we can then pick two variables each time and plot them on a scatter diagram to view their relationship. Recall that our research question focuses on whether there is a relationship between health, education, and house security system service spending and all different categories of crime rate in all 535 valid FSIP areas in San Diego County. That is, we are mostly interested in the correlation between each of the three spending variables and the crime rate.

Thus, in the below script, we apply the sns.regplot command to generate three scatterplots (each plot with a corresponding least-squared regression line) that indicate the relationship between each factor and the total crime aggregate: health care spending vs. total crime aggregate, education spending vs. total crime aggregate, and home security system spending vs. total crime aggregate.

After taking a closer look at each scatterplot, though the points seem to be random, by drawing the regression line, we observe a slight negative slope between each of the three factors and the total crime aggregate. This implies that the more spending on health care, education, and home security system in a valid FSIP area in San Diego County, the less total crime aggregate in that same area. This discovery also makes sense intuitively because, in reality, crime is much more prevalent among poor, disadvantaged neighborhoods (with less spending on health care, education, and home security system) than among wealthy and middle-class neighborhoods (with more spending on health care, education, and home security system). With such background knowledge, we could verify that our conclusion drawn from the below scatterplots makes sense.

From the three OLS regression model, we find that though all three graphs indicate negative slope, the R-squared for all three models are very small.

To extract more evidence that supports our previous claim, we would like to further investigate the correlation between each of the three factors and the total crime aggregate. This time, we use the standardized data from the t_standardized_data created in the Data Cleaning section.

Similar to the linear pattern in the above scatterplots on “unstandardized” data, we again observe a slight negative slope between each of the three factors and the t standardized crime. If we take a more careful look at each scatterplot, we notice that the slopes of the regression lines of corresponding scatterplots are exactly the same (unstandardized data vs. standardized data) since we use the same dataset only with a rescale.

From the three OLS regression model, we find that though all three graphs indicate negative slope, the R-squared for all three models are very small.

Next, we will further investigate the correlation between each of the three factors and the total crime aggregate using index values provided in the dataset. The index for each variable provides a view of the relative proportion of the value with respect to the entire population. For example, the overall crime index is based on the crime rate per certain (e.g. 10,000) population for all crimes in a specific area. Compared to the linear pattern in the scatterplots above, we now observe an even stronger negative correlation between each of the three factors and the total crime index, meaning that one unit increase in index of health care, education, or home security would mostly likely to decrease total crime index. Thus, plotting with the index values makes the relationship more distinct.

We also make three OLS regressions for index of health care v.s. total crime index, education index v.s. total crime index, and home security index v.s. total crime index. We found that the slope is more negative and R-square is higher than previous models. This implies that the models with index is more reliable compared with previous two.

The above nine scatterplots all display a solidly negative correlation between each of the three factors (health care, education, and home security system service spending) and the crime rate. As a result, it’s reasonable for us to draw the conclusion that the more spending on health care, education, and home security system in a valid FSIP area in San Diego County, the less total crime aggregate in that same area.

While we have reached the general conclusion as stated above, it is even better for us to further investigate how each of the three factors (health care, education, and home security system service spending) contributes to the nine different crime aggregates (i.e. murder, personal crime, rape, robbery, assault, property crime, burglary, larceny, and motor vehicle theft). To accomplish this, we will use line plots.

Line charts are best when we want to show how certain value changes over time or compare how several things change over time relative to each other. That being said, since we are trying to investigate how the nine different crime aggregates are impacted by health care spending, education spending, and home security system spending, we can take advantage of the nice properties of the line chart. We can set each of the spending factors as the x-axis and the crime aggregates as the y-axis, thereby plotting nine lines with different colors (each representing a specific type of crime) on the chart.

However, we will encounter a problem while plotting the line chart. Since we have a relatively large dataset, it contains tons of unique spending values, which represent the x values in the line chart. As a result, if we use our original data to plot the line chart, it will almost look like an abnormal ECG with more than 500 data points connecting together. In other words, the chart will be full of details such that we are unable to perceive the overall trend of each line at all.

To solve this problem, we decide to write a function called split_crime that helps us round down the values on the x-axis and eventually reduce the number of x values. To put it more simply, we try to combine several x values that are close to each other and make them share the same y value. To achieve this, we set each x value to be equal to itself subtracted by the mean, divided by the standard deviation over 4, added by 0.5, multiplied with the standard deviation over 4, and finally added by the mean.

After rescaling the x values, we apply the sns.lineplot command to plot the line chart for each of the three factors (health care, education, and home security system service spending) over the nine different crime aggregates. As shown below, all three charts share similar patterns: crimes including motor vehicle theft, robbery, assault, personal crime, and murder all achieve their peak aggregates at approximately the same place (around 6000000 dollars spending on health care, around 2,000,000 dollars spending on education, around 27,000 dollars spending on home security system service) and then gradually decrease as spending increases. This finding matches our conclusion drawn from the scatterplots (the more spending on health care, education, and home security system in a valid FSIP area in San Diego County, the less total crime aggregate in that same area).

To further investigate the impact of each of the three factors (health care, education, and home security system service spending) on the nine different crime aggregates, we perform the same visual analysis (i.e. plotting line charts) using the 9 crime index value. To correspond to the crime index, we used the index of health care, the index of education, and the index of security system service. This time, we have found a much stronger decreasing trend between our factors and crime index. Similar to the previous three graphs, we found that the motor vehicle yields the highest crime index among the three graphs, and the murder index yields the lowest crime index among the three graphs. Overall, when health care, education, and home security system spending increases, all nine different types of crime index decrease, despite the slower decreasing trend.

Other than drawing the scatterplot, we also applied ordinary least squares regression to mathematically compute the slope and p-value. We claim that the confidence interval is 5% and the null hypothesis be slope is 0. In other words, if p-value is less than 5%, we will have sufficient evidence to reject the null hypothesis and thus claim there is a correlation between the dependent variable and the independent variable. Since we are interested in if there is a correlation between education, healthcare, and home security on crime rate, we first study if each variable will be correlated to the crime rate.

Since the p-values for each regression model is less than 5%, the model suggests that there exists correlations between standardized crime and standardized healthcare, standardized education, and standardized home security respectively. From the summary, the slopes for three regression models are approximately around -0.20, meaning there is a negative correlation. Besides, increasing 1 standard deviation of independent variable would decrease dependent variable by 17% of standard deviation. Thus, these models are statistically and economically significant. However, the R-square in three models is around 0.03, meaning 3% of data could be explained by these models. This makes sense because there are a lot of factors contributing to crime rate, using only one factor to determine the crime rate by linear regression is not plausible. In conclusion, though the model shows that there are slight negative correlations between standardized crime and standardized healthcare, standardized education, and standardized home security respectively, the linear regression model with a single independent variable might not be the perfect model to predict crime rate.

We then run the linear regressions on the crime rate index with the healthcare index, education index, and home security index respectively. We found that the p-value for all three slopes are approximately 0, meaning that we are strongly confident to reject the null hypothesis. The regression models suggest the slopes are around -0.4 to -0.6, representing there are weak negative correlations in our models. Besides, increasing 1 standard deviation of the independent variable would decrease the dependent variable by 45%. Thus, these models are statistically and economically significant. Compared with the previous tests, when we used standardized aggregate data, we now have R-square at around 20%. Thus, it seems that it is more reasonable to use index data to build our model compared to standardized. We then assign two of the variables in [healthcare, education, and home security] and investigate if any combination of two independent variables will influence the crime rate.

Since the p-values for regression model of t_standardized_crime ~ t_standardized_healthcare + t_standardized_education and t_standardized_crime ~ t_standardized_education + t_standardized_home_security is larger than 5%, suggesting that we do not have enough evidence to reject null hypothesis. Thus we cannot indicate any correlations from these two regression models. In the last model, t_standardized_crime ~ t_standardized_home_security + t_standardized_healthcare, it suggests correlations between standardized crime, standardized healthcare, and standardized home security respectively. From the summary, we found that the p-value for both home security and healthcare are 0, so there is a correlation between crime, healthcare and home security. We also find the slope of home security is -1.133 and slope of healthcare is 0.9466. Since the R-square is 0.078, which means only 7.8% of data could be explained by this model, this regression model could not predict real-world crime rate well.

We then calculate the regression models for index. In each regression, there is only one variable with 0 p-value and one variable whose p-value is larger than 5%. In specific, the p-value of health care in either model is 0, suggesting there is a correlation between health care index and crime rate index. Since none of the regression models have 0 p-values for both variables, there is little meaningful analysis in this part.

While the p-value of the intercept in the regression model of t_standardized_crime ~ t_standardized_healthcare + t_standardized_education + t_standardized_home_security is equal to 1, this does not significantly contribute to the analysis of the statistics. However, since all other p-values are less than 5%, the model suggests that there exists a correlation between standardized crime and standardized healthcare, standardized education, and standardized home security. From the summary, the slope for the t_standardized_healthcare is 1.8137, indicating a positive correlation between the crime rate and health spending. Meanwhile, the slopes for the other two variables range from -0.5789 to -1.4491, indicating that there exists a negative correlation. Overall, since there exist both positive and negative correlations between the crime rate and the three variables, the model does not suggest a strong one-directional (negative) relationship between our variables. In addition, the R-square in this model is 0.098, meaning 1% of the data could be explained by these models. In conclusion, this linear regression model with three independent variables only suggests a very weak negative correlation between our independent variables and dependent variables.

We then run the linear regressions on the crime rate index with index_health_care, index_education, and index_home_security. Compared to the linear regression model for standardized variables, R-squared is 15% larger, which means our index model can represent 15% more of the actual data. We found that the p-value for all three slopes are approximately 0, meaning that we are strongly confident on our predicted slopes. We could see that index_education and index_home_security have a weak negative relationship with crime index, while index_health_care has a weak positive relationship with the crime index. A one unit increase of index_education will decrease crime index by 0.3174 on average; a one unit increase of index_home_security will decrease crime index by 1.0730 on average; a one unit increase of index_health_care will increase crime index by 0.9732 on average. Though the model is statistically and economically significant, the weak relationships with either positive or negative signs between our independent variables and dependent variables indicate that there might not be a strong negative correlation between them.

From all of the above linear regression models, we have found both positive and negative correlations between our independent and dependent variables, with the slope ranging from -1.4491 to 1.8137. Since all observed slopes are relatively close to 0, the models fail to suggest a strong one-directional relationship. Besides, the R-squared values in all linear regression models are extremely small. Therefore, only applying the linear regression model is not sufficient to explain the actual relationship between the variables of our interest. That being said, for further analysis on the topic, we decide to apply the sklearn to get a more precise prediction on our data. Specifically, we will use linear regression, radial basis function from support vector regression, polynomial from support vector regression, ridge regression, and Poisson regression to predict the crime rate.

In the following scripts, we first split the data into training and testing sets, 80% and 20% respectively. We then use the SVM command to train the data using different regression models, including linear regression, RBF regression, polynomial regression, ridge regression, and Poisson regression. We plotted three blocks of regression models (with independent variables as index_heath_care, index_education, index_home_security_system_svcs), each block with plots on both the training set and testing set.

For the plot of index_heath_care vs. index_crime, we found that while all regression models fail to represent our training data perfectly due to the large spread of the data points, the RBF regression model fits our training data the best among all. Besides, all regression models display a clear decreasing trend, indicating that as the spending on health care increases, the crime rate decreases. Although this observation matches what we observed in the OLS regression results (a very weak negative relationship between independent and dependent variables), it is obvious that all of the regression models have low accuracy on the training data. In a similar manner, if we take a closer look at the regression model plot on the testing data, we can observe that again all regression models display a clear decreasing trend, indicating that as the spending on health care increases, the crime rate decreases. Nevertheless, since the testing set only contains 20% of the data, the data points in the plot are even more spread out, leading to an even less accurate prediction of our regression models. Apart from that, the Accuracy Report below the plots states that the accuracy scores for all regression models are negative, ranging from -0.22 to -0.11, whereas the R2 scores are also negative, ranging from approximately -0.29 to -0.20. Even worse, the mean squared error values for all regression models are also extremely high (around 1,000). As a result, since both the training models and testing models are inaccurate (with accuracy, R2 scores below 0, and mean squared error about 1,000), we have concluded that there does not exist a relationship between health care spending and crime rate.

For the plot of index_education vs. index_crime, we found that while all regression models fail to represent our training data perfectly due to the large spread of the data points, the RBF regression model fits our training data the best among all. Besides, all regression models display a clear decreasing trend, indicating that as the spending on education increases, the crime rate decreases. Although this observation matches what we observed in the OLS regression results (a very weak negative relationship between independent and dependent variables), it is obvious that all of the regression models have low accuracy on the training data. In a similar manner, if we take a closer look at the regression model plot on the testing data, we can observe that again all regression models display a clear decreasing trend, indicating that as the spending on education increases, the crime rate decreases. Nevertheless, since the testing set only contains 20% of the data, the data points in the plot are even more spread out, leading to an even less accurate prediction of our regression models. Apart from that, the Accuracy Report below the plots states that the accuracy scores for all regression models are negative, ranging from -0.21 to -0.12, whereas the R2 scores are also negative, ranging from approximately -0.25 to -0.11. Even worse, the mean squared error values for all regression models are also extremely high (around 1,000). As a result, since both the training models and testing models are inaccurate (with accuracy, R2 scores below 0, and mean squared error about 1,000), we have concluded that there does not exist a relationship between education spending and crime rate.

For the plot of index_home_security_system_svcs vs. index_crime, we found that while all regression models fail to represent our training data perfectly due to the large spread of the data points, the RBF regression model fits our training data the best among all. Besides, all regression models display a clear decreasing trend, indicating that as the spending on home security system services increases, the crime rate decreases. Although this observation matches what we observed in the OLS regression results (a very weak negative relationship between independent and dependent variables), it is obvious that all of the regression models have low accuracy on the training data. In a similar manner, if we take a closer look at the regression model plot on the testing data, we can observe that again all regression models display a clear decreasing trend, indicating that as the spending on home security system services increases, the crime rate decreases. Nevertheless, since the testing set only contains 20% of the data, the data points in the plot are even more spread out, leading to an even less accurate prediction of our regression models. Apart from that, the Accuracy Report below the plots states that the accuracy scores for all regression models are negative, ranging from -0.20 to -0.12, whereas the R2 scores are also negative, ranging from approximately -0.35 to -0.24. Even worse, the mean squared error values for all regression models are also extremely high (around 1,000). As a result, since both the training models and testing models are inaccurate (with accuracy, R2 scores below 0, and mean squared error about 1,000), we have concluded that there does not exist a relationship between home security system services spending and crime rate.

Similar to previous analysis, we assign two of the variables in [healthcare, education, and home security] and build regression models based on combinations of two independent variables. Similarly, we split our data into 80% as training data and 20% as testing data. For convenience, we wrote a function to show the 3d distribution of data, training data with its prediction line, and testing data with its prediction results.

We first analyze the relationship of index health care and index education. From the first graph, we could see the distribution of data is randomly separated. From the second graph, we found that our regression prediction surfaces, though aligned with a lot of data, there are still large numbers of data not on the surfaces. Similarly, in the last graph, though the number of data not on the surfaces decreases, the total number of data presented on the graph decreases as well because we only present 20% of testing data in the third graph. The accuracy report coordinates with our conclusion since each model has an R2 score that is all below 0 with huge mean squared error (around 1000) and low accuracy. Thus, we claim that there does not exist a relationship between index health care and index education.

We then analyze the relationship of index home security and index education. From the first graph, we could see the distribution of data is randomly separated. From the second graph, we found that our regression prediction surfaces, though aligned with a lot of data, there are still large numbers of data not on the surfaces. Similarly, in the last graph, though the number of data not on the surfaces decreases, the total number of data presented on the graph decreases as well because we only present 20% of testing data in the third graph. The accuracy report coordinates with our conclusion since each model has an R2 score that is all below 0 with huge mean squared error (around 1000) and low accuracy. Thus, we claim that there does not exist a relationship between index home security and index education.

Finally, we analyze the relationship of index home security and index health care. From the first graph, we could see the distribution of data is randomly separated. From the second graph, we found that our regression prediction surfaces, though aligned with a lot of data, there are still large numbers of data not on the surfaces. Similarly, in the last graph, though the number of data not on the surfaces decreases, the total number of data presented on the graph decreases as well because we only present 20% of testing data in the third graph. The accuracy report coordinates with our conclusion since each model has an R2 score that is all below 0 with huge mean squared error (around 1000) and low accuracy. Thus, we claim that there does not exist a relationship between index home security and index education.

In order to further visualize the correlation between education index, healthcare index, home security index, and total crime index, we draw the 3D graph where x, y, z represents education index, healthcare index, home security index and color represents total crime index. From the graph, we can easily find that every data is randomly separated and we cannot find a specific trend among these data. Our data supports our finds. From accuracy reports, we find that each model has an R2 score that is all below 0 with huge mean squared error (over 1000). Besides, the accuracy of each model is very low. Thus, we have concluded that there does not exist a relationship between education index, healthcare index, home security index, and total crime rate index.

Ethics & Privacy

1) The Question

2) The Implications

3) The Data

4) Informed Consent

5) Privacy

6) Evaluation

7) Analysis

8) Transparency and Appeal

9) Continuous Monitoring

Conclusion & Discussion

Recall our research question:

Is there a relationship between health, education, and house security spending and all different categories of crime rates in all 441 valid FSIP areas in San Diego County?

And hypothesis:

The higher the average household health / education / home security system spending the sector has, the lower the crime rate sector has.

Now we would like to make a comprehensive conclusion on the relationship between health care, education, and home security system spending and crime rate. In the above analysis, we created regression models on different variables using ordinary least squares and visualized the relationship between them using a scatter plot. However, from all of the linear regression models, we have found both positive and negative correlations between our independent and dependent variables, indicating that there does not appear to be a clear one-directional relationship. Thus, for further analysis, we applied the sklean and created different regression models including linear regression, radial basis function from support vector regression, polynomial from support vector regression, ridge regression, and Poisson regression to predict the relationship between our variables of interest. Unfortunately, we have determined negative accuracy and R2 scores for all these regression models as well as the relatively high mean squared error (about 1,000). Therefore, we cannot derive a relationship from the above analyses. We have reached to the conclusion that there is no significant relationship between health care, education, home security spending, and the crime rate.

Limitation of our model:

Our analysis basically concluded that there are no clear relationships between our independent variables and dependent variables. However, some inevitable limitations of our research do exist. Firstly, since San Diego county is a relatively small area, we only have a limited amount of data, which might make our models not representative of the true relationship between independent variables and dependent variable. Secondly, our OLS model might not be 100 percent accurate due to omitted variable bias. The bias exists because other variables will affect crime rate and are correlated with either education, health, or home security at the same time. Those outside variables might affect the crime rate and either move along the same direction/different direction with our independent variables, causing our OLS models to overestimate/underestimate the true effects of each variable. The dependent variables and independent varibles might also be interrelated, causing the slope coefficients to be inaccurate. Thirdly, we cannot determine whether it is a type 1 error or type 2 error. Although some of our models indicate a weak negative relationship between our independent variables and dependent variables, our R square is fairly low, which means the majority of our data cannot be explained by the OLS models.

Team Contributions

Team Expectations

Project Timeline Proposal

Meeting Date Meeting Time Completed Before Meeting Discuss at Meeting
1/20 10 AM Read & Think about COGS 108 expectations; brainstorm topics/questions Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research
1/26 2 PM Do background research on topic Discuss ideal dataset(s) and ethics; draft project proposal
2/1 9 AM Search for datasets; data cleaning find OLS model
2/14 8 PM try to analysis the OLS model finishing OLS model and write analysis
2/23 8 PM finishing OLS try to find more models
3/13 3 PM Finishing the final project Complete video and powerpoint
3/14 Before 11:59 PM NA Turn in Final Project & Group Project Surveys